A lot of code was inspired by the following tutorials, which shall herewith be acknoledged. https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/cifar10_tutorial.ipynb https://colab.research.google.com/github/pranjalchaubey/Deep-Learning-Notes/blob/master/PyTorch%20Image%20Classification%20in%202020/Image_Classification_practice.ipynb

Install all the packages we need. Mainly needed for Google colab and/or Saturn Cloud. Don't have access to ECS computers, so might not be needed there.

Housekeeping stuff... import all the libraries we need and tell matplotlib to make good use of screen space in Juptyter. Try to fix as many sources of randomness for reproducability and figure out if we get a CPU or GPU to run the job.

Remove the .ipynb_checkpoints nuissance folder (used by Jyputer for autosave) and check out the dimensions of the pictures in the training data set. The assignement description eluded already to the fact that there might be outlier here, so we generate a frequency table of the various dimensions to see what's what.

Looks like we have quite the zoo of dimensions. Maybe worth checking out how those outlier look like compared to the majority of cases. So let's create indexes for 300x300 images vs. the rest and plot some images.

Hm... doesn't really look like we would want to have these images in our training dataset anyways... they are a bit atypical. Let's compare with the ones that have standard dimensions of 300x300.

Yes... it seems those are the real deal. So maybe best to stick to these pictures and collectively discard the others as "too noisy" outliers. Let's quickly check as well if the classes are balanced.

looks roughly balanced to me.

Defining the image augmentations. Deterministically flipping images horizontally, vertically and both and then adding each of these images sets to the original dataset. All of these are then subjected to random transformations (scaling, rotation) and then added to the previous data set. So we should end up with roughly an eightfold increase in training data. The image normalisation (mymean and mystd) is taken from prior research done by the image classification community (c.f. orgigin of these values here https://github.com/pytorch/vision/pull/1965)

Next thing we need we exclude the bad/noisy pictures (using goodidx to create a Subset). Then we need load all the data and split it in training and validation set. Things got a bit tricky with image augmentation and training/validation split. You can define just random transformation which will kick in (in the training and validation data set) at random during various epochs. The advantage is that training and validation images remain strictly separated. But I found this approach to not perform as good as when you create a whole new training data set where all sorts of transformations are applied deterministically as well as some at random (as we have done above). The problem with that approach is that in the normal workflow the split into training and validation data set occurs AFTER the creation of the augmented dataset (inside the DataLoader). This leads to heavy contamination of the validation dataset, e.g. if an image that exists in the training data set is being rotated by one degree and then ends up in the validation data set. Hence manually made sure that the split of the images into training images and validation images carries forward even after transformations and augmentations are applied. I also decided to NOT included any transformed images into the validation dataset, to make sure that data set is as close to a real test set one would encounter "in the wild"

We can see, we ended up with 26k training images and roughly 800 validation images. Quickly making sure as well that I din't stuff anything up... particularly that the class labels all match with the images. Just checking that the class distribution is balanced in both indices. I stuffed that up before by not using stratified sampling.

looks good.. classes are balanced in both training as well as validation data set (indices)

looking good... moving on to define training function with the help of poutyne

Defining a plain vanilla feed forward neural network as benchmark.